class: title-slide, left, bottom # Combining a smooth information criterion with neural networks ---- ## **Andrew McInerney**, ** ** ### University of Limerick #### LMU, 07 July 2023 --- # Introduction -- .pull-left[ <img src="data:image/png;base64,#img/limerick-map.png" width="100%" style="display: block; margin: auto;" /> ] -- <img src="data:image/png;base64,#img/limerick-city.jpg" width="30%" style="display: block; margin: auto 0 auto auto;" /> <img src="data:image/png;base64,#img/king-johns.jpg" width="30%" style="display: block; margin: auto 0 auto auto;" /> --- # Background -- <img src="data:image/png;base64,#img/crt-logo.jpg" width="60%" style="display: block; margin: auto;" /> -- * Research: Neural networks from a statistical-modelling perspective -- <img src="data:image/png;base64,#img/packages.png" width="70%" style="display: block; margin: auto;" /> --- class: selectnn-slide # Model Selection <img src="data:image/png;base64,#img/modelsel.png" width="90%" style="display: block; margin: auto;" /> A Statistically-Based Approach to Feedforward Neural Network Model Selection (arXiv:2207.04248) --- class: selectnn-slide # Insurance: Model Selection ```r library(selectnn) nn <- selectnn(charges ~ ., data = insurance, Q = 8, n_init = 5) summary(nn) ``` -- ```{.bg-primary} ## [...] ## Number of input nodes: 4 ## Number of hidden nodes: 2 ## ## Value: 1218.738 ## Covariate Selected Delta.BIC ## smoker.yes Yes 2474.478 ## bmi Yes 919.500 ## age Yes 689.396 ## children Yes 13.702 ## [...] ``` --- class: interpretnn-slide # Interpreting FNNs Extend packages: **nnet**, **neuralnet**, **keras**, **torch** * Significance testing * Covariate-effect plots --- class: interpretnn-slide # Insurance: Model Summary ```r intnn <- interpretnn(nn) summary(intnn) ``` -- ```{.bg-primary} ## Coefficients: ## Weights | X^2 Pr(> X^2) ## age (0.19, -0.41***) | 24.1009 5.84e-06 *** ## sex.male (-0.25, 0.05.) | 3.6364 1.62e-01 ## bmi (-26.11***, -0.03*) | 14.7542 6.25e-04 *** ## children (0.16, -0.07***) | 13.1946 1.36e-03 ** ## smoker.yes (63.64***, -2.83***) | 62.8237 2.28e-14 *** ## region.northwest (-3.65., 0.03) | 3.4725 1.76e-01 ## region.southeast (-1.95*, 0.08*) | 7.8144 2.01e-02 * ## region.southwest (-1.27, 0.12**) | 9.1267 1.04e-02 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- class: interpretnn-slide # Insurance: Model Summary ```r plotnn(intnn) ``` -- <img src="data:image/png;base64,#img/plotnn.png" width="70%" style="display: block; margin: auto;" /> --- class: interpretnn-slide # Insurance: Covariate Effects ```r plot(intnn, conf_int = TRUE, which = c(1, 4)) ``` -- .pull-left[ <!-- --> ] -- .pull-right[ <!-- --> ] --- # Current Work -- <br> .pull-left[ <img src="data:image/png;base64,#img/kevin-meadhbh.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right[ <img src="data:image/png;base64,#img/sic-publication.png" width="100%" style="display: block; margin: auto;" /> ] --- # Smooth Information Criterion $$ \text{IC} = -2\ell(\theta) + \lambda [\lVert \tilde\theta \rVert_0 + 1] $$ where `\(\lambda = \log(n)\)` for the BIC. -- Rearrange as an IC-based penalized likelihood: `$$\ell^{\text{IC}}(\theta) = \ell(\theta) - \frac{\log(n)}{2} [\lVert \tilde\theta \rVert_{0} + 1]$$` --- # Smooth Information Criterion Introduce "smooth `\(L_0\)` norm": `$$\lVert \theta \rVert_{0, \epsilon} = \sum_{j=1}^p \phi_\epsilon (\theta_j)$$` where $$ \phi_\epsilon(\theta_j) = \frac{{\theta_j^2}}{\theta_j^2 + \epsilon^2} $$ --- # Smooth Information Criterion <img src="data:image/png;base64,#img/smooth-l0.png" width="80%" style="display: block; margin: auto;" /> --- # Motivation -- * Tuning parameter automatically selected in one step <br> -- * Computationally advantageous --- # `\(\epsilon\)`-telescoping -- * Optimal `\(\epsilon\)` is zero -- * Smaller `\(\epsilon\)` `\(\implies\)` less numerically stable -- * Start with larger `\(\epsilon\)`, and "telescope" through a decreasing sequence of `\(\epsilon\)` values using warm starts --- # Algorithm <img src="data:image/png;base64,#img/sic-algorithm.png" width="50%" style="display: block; margin: auto;" /> --- # Results <img src="data:image/png;base64,#img/sic-results.png" width="50%" style="display: block; margin: auto;" /> --- # R Package <img src="data:image/png;base64,#img/smoothic.png" width="70%" style="display: block; margin: auto;" /> --- # Extending to Neural Networks `$$\mathbb{E}(y) = \text{NN}(X, \theta)$$` -- where `$$\text{NN}(X, \theta) = \phi_o \left[ \gamma_0+\sum_{k=1}^q \gamma_k \phi_h \left( \sum_{j=0}^p \omega_{jk}x_{j}\right) \right]$$` --- # Extending to Neural Networks We can then formulate a **smooth** BIC-based penalized likelihood: -- `\begin{equation*} \ell^{\text{SIC}}(\theta) = \ell(\theta) - \frac{\log(n)}{2} [\lVert \tilde\theta \rVert_{0, \epsilon} + q + 1], \end{equation*}` -- where `\begin{equation*} \ell(\theta)= -\frac{n}{2}\log(2\pi\sigma^2)-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i-\text{NN}(x_i))^2 \end{equation*}` --- # Extending to Group Sparsity The smooth approximation of the `\(L_0\)` norm can be written for groups as $$ \phi_\epsilon(\theta^{(g)}) = \lvert \theta^{ (g) } \rvert \frac{ {\lVert \theta^{ (g) } \rVert}_2^2}{ {\lVert \theta^{ (g) } \rVert}_2^2 + \epsilon^2}. $$ --- # Group Sparisty -- ## Input-node penalization -- `\begin{equation*} \ell^{\text{IN-SIC}}(\theta) = \ell(\theta) - \frac{\log(n)}{2} \left[\sum_{j=1}^{p} \lVert \omega_{j} \rVert_{0, \epsilon} + \lVert \tilde\gamma \rVert_{0, \epsilon} + q + 1\right], \end{equation*}` where `\(\omega_{j} = (\omega_{j1},\omega_{j2},\dotsc,\omega_{jq})^T\)` --- # Group Sparisty -- ## Hidden-node penalization -- `\begin{equation*} \ell^{\text{HN-SIC}}(\theta) = \ell(\theta) - \frac{\log(n)}{2} \left[\sum_{k=1}^{q} \lVert \theta^{(k)} \rVert_{0, \epsilon} + q + 1\right], \end{equation*}` where `\(\theta^{(k)} = (\omega_{1k},\omega_{2k},\dotsc,\omega_{pk}, \gamma_k)^T\)` --- # Combined Penalty * Implement a group penalty and the single-parameter penalty in one optimization procedure * Start with group penalization and telescope through the `\(\epsilon\)` values until some predefined change point, `\(\tau\)` * Switch to single-parameter penalization for the remainder of the `\(\epsilon\)` values --- # Approaches * Single-parameter penalization * Input-node penalization * Hidden-node penalization * Combined approaches (perform group penalization initially and then switch to single-parameter penalization) --- # Preliminary Simulation <img src="data:image/png;base64,#img/nn-sim-plot.png" width="100%" style="display: block; margin: auto;" /> --- # Preliminary Results --- class: bigger # References * O’Neill, M. and Burke, K. (2023). Variable selection using a smooth information criterion for distributional regression models. *Statistics and Computing, 33(3), p.71*. ### R Packages ```r devtools::install_github(c("andrew-mcinerney/selectnn", "andrew-mcinerney/interpretnn")) ```
<font size="5.5">andrew-mcinerney</font>
<font size="5.5">@amcinerney_</font>
<font size="5.5">andrew.mcinerney@ul.ie</font>